from dateutil.parser import parse
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import plotly.express as px
mpl.rcParams['axes.grid']=False
plt.rcParams.update({'figure.figsize': (10, 7), 'figure.dpi': 120})
df=pd.read_csv('shampoo_sales.csv')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 36 entries, 0 to 35 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Month 36 non-null object 1 Sales 36 non-null float64 dtypes: float64(1), object(1) memory usage: 704.0+ bytes
df.shape
(36, 2)
df.head()
| Month | Sales | |
|---|---|---|
| 0 | 1-01 | 266.0 |
| 1 | 1-02 | 145.9 |
| 2 | 1-03 | 183.1 |
| 3 | 1-04 | 119.3 |
| 4 | 1-05 | 180.3 |
df.rename(columns={'Month':'Monthly Sales','Sales':'Frequency of Shampoo sold'}, inplace=True)
df.head()
| Monthly Sales | Frequency of Shampoo sold | |
|---|---|---|
| 0 | 1-01 | 266.0 |
| 1 | 1-02 | 145.9 |
| 2 | 1-03 | 183.1 |
| 3 | 1-04 | 119.3 |
| 4 | 1-05 | 180.3 |
***Time Series Chart using Plotly library
fig = px.line(df.reset_index(), x ='Monthly Sales', y = 'Frequency of Shampoo sold', title = 'Total sales from Jan - Mar 2021')
fig.update_xaxes(
rangeslider_visible = True,
rangeselector = dict(
buttons = list([
dict(count=1, label="1y", step="year", stepmode="backward"),
dict(count=2, label="2y", step="year", stepmode="backward"),
dict(count=3, label="3y", step="year", stepmode="backward"),
dict(step="all")
])
)
)
fig.show()
***Not all days are evenly represented in the dataframe and we will like to forcast the sales of the shampoo with respect to the consequent months there is a need for resampling.
**The central tendency measure MEAN is employed to generate the missing days in the three months. Firstly, represent the missing days with NaN.
from datetime import datetime
def parser(x):
return datetime.strptime('202'+x, '%Y-%m')
df_ss = pd.read_csv('shampoo_sales.csv', header = 0, index_col = 0, parse_dates = True,
squeeze = True, date_parser = parser)
upsampled_df = df_ss.resample('D').mean()
upsampled_df.head(10)
Month 2021-01-01 266.0 2021-01-02 NaN 2021-01-03 NaN 2021-01-04 NaN 2021-01-05 NaN 2021-01-06 NaN 2021-01-07 NaN 2021-01-08 NaN 2021-01-09 NaN 2021-01-10 NaN Freq: D, Name: Sales, dtype: float64
We have been able to generate the missing values using MEAN, then we need to intepolate the missing values since the methods below:
[A] Up-sampling frequency method {ie, generating daily data from monthly data}.
We can interpolate the missing values at this new frequency. The function, interpolate() of pandas library is used to interpolate the missing values. We use a linear interpolation which draws a straight line between available data, on the first day of the month(2021-01-01) and fills in values at the chosen frequency from this line.
***Types of Interpolation: Linear interpolation; polynominal/spline interpolation; quadratic interpolation; nearest interpolation; slinear interpolation; zero interpolation; cubic interpolation.
fig = px.line(upsampled_df.interpolate(method = 'linear'))
fig.update_xaxes(
rangeslider_visible = True,
rangeselector = dict(
buttons = list([
dict(count=1, label="1y", step="year", stepmode="backward"),
dict(count=2, label="2y", step="year", stepmode="backward"),
dict(count=3, label="3y", step="year", stepmode="backward"),
dict(step="all")
])
)
)
fig.show()
#Interpolate using Linear type - interpolation replaces missing values with linearly-spaced values between the two nearest defined data points.
interpolated = upsampled_df.interpolate(method = 'linear')
interpolated.head(10)
Month 2021-01-01 266.000000 2021-01-02 262.125806 2021-01-03 258.251613 2021-01-04 254.377419 2021-01-05 250.503226 2021-01-06 246.629032 2021-01-07 242.754839 2021-01-08 238.880645 2021-01-09 235.006452 2021-01-10 231.132258 Freq: D, Name: Sales, dtype: float64
***Interpolation using Spine or polynomial type - interpolation fits a cubic polynomial to the points around the missing values. This is a painfully slow method that usually gives best results.
***Spine interpolation: This creates more curves and look more natural on many datasets.Using a spline interpolation requires you specify the order (count of terms in the polynomial); we use 2.
fig = px.line(upsampled_df.interpolate(method = 'spline', order = 2))
fig.update_xaxes(
rangeslider_visible = True,
rangeselector = dict(
buttons = list([
dict(count=1, label="1y", step="year", stepmode="backward"),
dict(count=2, label="2y", step="year", stepmode="backward"),
dict(count=3, label="3y", step="year", stepmode="backward"),
dict(step="all")
])
)
)
fig.show()
interpolated1 = upsampled_df.interpolate(method = 'spline', order = 2)
interpolated1.head(10)
Month 2021-01-01 266.000000 2021-01-02 258.630160 2021-01-03 251.560886 2021-01-04 244.720748 2021-01-05 238.109746 2021-01-06 231.727880 2021-01-07 225.575149 2021-01-08 219.651553 2021-01-09 213.957094 2021-01-10 208.491770 Freq: D, Name: Sales, dtype: float64
#NEAREST INTERPOLATION USING PLOTLY CHART
fig = px.line(upsampled_df.interpolate(method = 'nearest'))
fig.update_xaxes(
rangeslider_visible = True,
rangeselector = dict(
buttons = list([
dict(count=1, label="1y", step="year", stepmode="backward"),
dict(count=2, label="2y", step="year", stepmode="backward"),
dict(count=3, label="3y", step="year", stepmode="backward"),
dict(step="all")
])
)
)
fig.show()
#Nearest interpolation: It interpolation replaces missing values with the previous defined value.
interpolated3 = upsampled_df.interpolate(method = 'nearest')
interpolated3.head(10)
Month 2021-01-01 266.0 2021-01-02 266.0 2021-01-03 266.0 2021-01-04 266.0 2021-01-05 266.0 2021-01-06 266.0 2021-01-07 266.0 2021-01-08 266.0 2021-01-09 266.0 2021-01-10 266.0 Freq: D, Name: Sales, dtype: float64
#QUADRATIC INTERPOLATION USING PLOTLY CHART
fig = px.line(upsampled_df.interpolate(method = 'quadratic'))
fig.update_xaxes(
rangeslider_visible = True,
rangeselector = dict(
buttons = list([
dict(count=1, label="1y", step="year", stepmode="backward"),
dict(count=2, label="2y", step="year", stepmode="backward"),
dict(count=3, label="3y", step="year", stepmode="backward"),
dict(step="all")
])
)
)
fig.show()
#Quadratic and spline interpolation have slight differences in terms of the cummulative data.
interpolated4 = upsampled_df.interpolate(method = 'quadratic')
interpolated4.head(10)
Month 2021-01-01 266.000000 2021-01-02 258.653962 2021-01-03 251.539381 2021-01-04 244.656256 2021-01-05 238.004587 2021-01-06 231.584374 2021-01-07 225.395618 2021-01-08 219.438318 2021-01-09 213.712474 2021-01-10 208.218086 Freq: D, Name: Sales, dtype: float64
NOTE***Quadratic and spline interpolation have slight differences in terms of the cummulative data.
***Conclusion: We have been able to forecast the sales of shampoo for 2022 and 2023. The data displayed on the charts are monthly data from January to December for 2021, 2022 and 2023.
[B] ***Two Ways of Down-sampling Frequency Down-sampling Frequency {ie, generating quartely data from monthly data}.
The Shampoo sales data presented monthly, but we prefer the data to be quarterly. The year can be divided into 4 business quarters, 3 months a piece. Then, the resample() function will group all observations by the new frequency. We need to decide how to create a new quarterly value from each group of 3 records. We shall use the mean() function to calculate the average monthly sales numbers (shampoo) for the quarter.
***Down-sampling Frequency {ie, generating yearly data from monthly data}.
We can turn monthly data into yearly data. Down-sample the data using the alias, 'A' for year-end frequency and this time use sum to calculate the total sales each year.
#FROM QUARTERLY TO YEARLY RESAMPLING CHART USING MATPLOTLIB
resample = df_ss.resample('Q')
quarterly_mean_sales = resample.mean()
quarterly_mean_sales.plot()
plt.show()
#Quarterly segementation of the data and forecast from 2021 to 2023. The sum of three months divided by the total number of days.
quarterly_mean_sales.head(10)
Month 2021-03-31 198.333333 2021-06-30 156.033333 2021-09-30 216.366667 2021-12-31 215.100000 2022-03-31 184.633333 2022-06-30 250.566667 2022-09-30 273.166667 2022-12-31 342.800000 2023-03-31 365.333333 2023-06-30 426.000000 Freq: Q-DEC, Name: Sales, dtype: float64
#YEARLY TO MONTHLY RESAMPLING USING MATPLOTLIB
resample = df_ss.resample('A')
yearly_mean_sales = resample.mean()
yearly_mean_sales.plot()
plt.show()
#Yearly segementation of the data and forecast from 2021 to 2023. The sum of three quarters divided by 4.
yearly_mean_sales.head()
Month 2021-12-31 196.458333 2022-12-31 262.791667 2023-12-31 478.550000 Freq: A-DEC, Name: Sales, dtype: float64
***Conclusion: From the interpolation the missing days were generated and aid in forecasting the remaining days, months and year{2022,2023}.